AITopics | chain-of-thought prompting

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Anglin, Kylie L., Milan, Stephanie, Hernandez, Brittney, Ventura, Claudia

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

arXiv.org Artificial IntelligenceDec-4-2025

Due to their architecture and vast pre-training data, large language models (LLMs) demonstrate strong text classification performance. However, LLM output - here, the category assigned to a text - depends heavily on the wording of the prompt. While literature on prompt engineering is expanding, few studies focus on classification tasks, and even fewer address domains like psychology, where constructs have precise, theory-driven definitions that may not be well represented in pre-training data. We present an empirical framework for optimizing LLM performance for identifying constructs in texts via prompt engineering. We experimentally evaluate five prompting strategies --codebook-guided empirical prompt selection, automatic prompt engineering, persona prompting, chain-of-thought reasoning, and explanatory prompting - with zero-shot and few-shot classification. We find that persona, chain-of-thought, and explanations do not fully address performance loss accompanying a badly worded prompt. Instead, the most influential features of a prompt are the construct definition, task framing, and, to a lesser extent, the examples provided. Across three constructs and two models, the classifications most aligned with expert judgments resulted from a few-shot prompt combining codebook-guided empirical prompt selection with automatic prompt engineering. Based on our findings, we recommend that researchers generate and evaluate as many prompt variants as feasible, whether human-crafted, automatically generated, or ideally both, and select prompts and examples based on empirical performance in a training dataset, validating the final approach in a holdout set. This procedure offers a practical, systematic, and theory-driven method for optimizing LLM prompts in settings where alignment with expert judgment is critical.

large language model, machine learning, natural language, (18 more...)

2512.03818

Country:

North America > United States > Connecticut (0.04)
Asia > India (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Albers, Nele, de Groot, Esra Cemre Su, Keijsers, Loes, Hillegers, Manon H., Krahmer, Emiel

Can we use LLMs to bootstrap reinforcement learning? -- A case study in digital health behavior change

arXiv.org Artificial IntelligenceNov-25-2025

Personalizing digital applications for health behavior change is a promising route to making them more engaging and effective. This especially holds for approaches that adapt to users and their specific states (e.g., motivation, knowledge, wants) over time. However, developing such approaches requires making many design choices, whose effectiveness is difficult to predict from literature and costly to evaluate in practice. In this work, we explore whether large language models (LLMs) can be used out-of-the-box to generate samples of user interactions that provide useful information for training reinforcement learning models for digital behavior change settings. Using real user data from four large behavior change studies as comparison, we show that LLM-generated samples can be useful in the absence of real data. Comparisons to the samples provided by human raters further show that LLM-generated samples reach the performance of human raters. Additional analyses of different prompting strategies including shorter and longer prompt variants, chain-of-thought prompting, and few-shot prompting show that the relative effectiveness of different strategies depends on both the study and the LLM with also relatively large differences between prompt paraphrases alone. We provide recommendations for how LLM-generated samples can be useful in practice.

large language model, machine learning, natural language, (17 more...)

2511.1763

Country:

Europe > Austria > Vienna (0.14)
Europe > Netherlands > South Holland > Rotterdam (0.04)
North America > United States > New York > New York County > New York City (0.04)
(11 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Consumer Health (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Fröhlich, Thorsten, Schlippe, Tim

RubiSCoT: A Framework for AI-Supported Academic Assessment

arXiv.org Artificial IntelligenceNov-24-2025

The evaluation of academic theses is a cornerstone of higher education, ensuring rigor and integrity. Traditional methods, though effective, are time-consuming and subject to evaluator variability. This paper presents RubiSCoT, an AI-supported framework designed to enhance thesis evaluation from proposal to final submission. Using advanced natural language processing techniques, including large language models, retrieval-augmented generation, and structured chain-of-thought prompting, RubiSCoT offers a consistent, scalable solution. The framework includes preliminary assessments, multidimensional assessments, content extraction, rubric-based scoring, and detailed reporting. We present the design and implementation of RubiSCoT, discussing its potential to optimize academic assessment processes through consistent, scalable, and transparent evaluation.

large language model, machine learning, natural language, (17 more...)

2510.17309

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > Oregon > Multnomah County > Portland (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Cologne (0.04)

Genre:

Instructional Material (1.00)
Overview (0.95)
Research Report > Experimental Study (0.69)

Industry:

Education > Educational Technology > Educational Software > Computer-Aided Assessment (1.00)
Education > Educational Setting (1.00)
Education > Assessment & Standards (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsNov-20-2025, 06:26:19 GMT

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Supplementary Materials

Fish-10K was manually curated with the help of biological experts co-authored in this paper.

large language model, machine learning, natural language, (18 more...)

Country:

North America > United States > Ohio (0.04)
Oceania > New Zealand (0.04)
Oceania > Cook Islands (0.04)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Neural Information Processing SystemsOct-10-2025, 20:37:28 GMT

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Supplementary Materials

Fish-10K was manually curated with the help of biological experts co-authored in this paper.

dataset, incorrect prediction, prediction, (12 more...)

Country:

North America > United States > Ohio (0.04)
Oceania > New Zealand (0.04)
Oceania > Cook Islands (0.04)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.36)

Neural Information Processing SystemsOct-9-2025, 19:55:07 GMT

19e4ea30dded58259665db375885e412-Paper-Datasets_and_Benchmarks_Track.pdf

chain-of-thought prompting, commonsense knowledge, obscure typographical measurement system, (16 more...)

Country:

North America > United States > Texas > Travis County > Austin (0.27)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.14)
(31 more...)

Genre:

Research Report > New Finding (0.92)
Research Report > Experimental Study (0.67)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Quality (1.00)
(11 more...)

Neural Information Processing SystemsAug-17-2025, 06:40:04 GMT

9d5609613524ecf4f15af0f7b31abca4-Supplemental-Conference.pdf

exemplar, large language model, machine learning, (17 more...)

Country:

North America > United States > Pennsylvania (0.04)
Asia > Vietnam (0.04)
North America > Mexico (0.04)

Industry:

Media (0.93)
Consumer Products & Services (0.67)
Leisure & Entertainment > Sports > Basketball (0.46)
Leisure & Entertainment > Sports > Football (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsAug-17-2025, 06:39:59 GMT

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

The empirical gains can be striking.

large language model, machine learning, natural language, (16 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Shafiei, Mohammadamin, Saffari, Hamidreza, Moosavi, Nafise Sadat

More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

arXiv.org Artificial IntelligenceJun-5-2025

Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.

computational linguistic, large language model, natural language, (16 more...)